{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Chapter 2: Pre-Model Workflow and Data Preprocessing\n",
    "\n",
    "This notebook provides \"recipes\" for using the scikit-learn Python library to preprocess data before modeling. Each recipe includes explanations, code examples, visualizations, best practices, and common pitfalls."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Handling Missing Data\n",
    "\n",
    "In this section, we will explore different strategies for handling missing data using scikit-learn's imputation tools. \n",
    "\n",
    "### Getting ready\n",
    "\n",
    "To begin, we will create a toy dataset composed of random, quantitative data, ten features, and several missing data values randomly spread throughout. We will then store the dataset in a pandas DataFrame() object for better readability."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "aaca5f76",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load libraries\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "# Create a larger sample dataset with missing values\n",
    "np.random.seed(2024)  # For reproducibility\n",
    "n_samples = 20\n",
    "n_features = 10\n",
    "\n",
    "# Generate random data\n",
    "data = {\n",
    "    f\"Feature{i+1}\": np.random.uniform(0, 100, n_samples) for i in range(n_features)\n",
    "}\n",
    "\n",
    "# Convert to DataFrame\n",
    "df = pd.DataFrame(data)\n",
    "\n",
    "# Randomly introduce missing values (approximately 20% of the data)\n",
    "for column in df.columns:\n",
    "    mask = np.random.random(n_samples) < 0.2\n",
    "    df.loc[mask, column] = np.nan\n",
    "\n",
    "# Display the DataFrame with missing values\n",
    "display(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### How to do it...\n",
    "\n",
    "The `SimpleImputer` class provides basic strategies for imputing missing values. It can replace missing values using a constant, the mean, median, or most frequent value of each column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load libraries\n",
    "from sklearn.impute import SimpleImputer\n",
    "\n",
    "# Initialize the SimpleImputer and set the strategy to \"mean,\" \"median\", or \"most_frequent\"\n",
    "imputer = SimpleImputer(strategy=\"mean\")\n",
    "\n",
    "# Fit and transform the data\n",
    "imputed_data = imputer.fit_transform(df)\n",
    "imputed_df = pd.DataFrame(imputed_data, columns=df.columns)\n",
    "imputed_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `KNNImputer` class uses the k-Nearest Neighbors approach to impute missing values. It considers the nearest neighbors to estimate the missing values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load libraries\n",
    "from sklearn.impute import KNNImputer\n",
    "\n",
    "# Initialize the KNNImputer\n",
    "knn_imputer = KNNImputer(n_neighbors=2)\n",
    "\n",
    "# Fit and transform the data using the previously defined DataFrame\n",
    "knn_imputed_data = knn_imputer.fit_transform(df)\n",
    "knn_imputed_df = pd.DataFrame(knn_imputed_data, columns=df.columns)\n",
    "knn_imputed_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `IterativeImputer` class models each feature with missing values as a function of other features, and iteratively estimates missing values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load libraries\n",
    "from sklearn.experimental import enable_iterative_imputer # Experimental feature requires loading\n",
    "from sklearn.impute import IterativeImputer\n",
    "\n",
    "# Initialize the IterativeImputer\n",
    "iterative_imputer = IterativeImputer()\n",
    "\n",
    "# Fit and transform the data using the previously defined DataFrame\n",
    "iterative_imputed_data = iterative_imputer.fit_transform(df)\n",
    "iterative_imputed_df = pd.DataFrame(iterative_imputed_data, columns=df.columns)\n",
    "iterative_imputed_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Scaling Techniques\n",
    "\n",
    "Scaling and normalization are crucial steps in preprocessing data for machine learning models. They ensure that each feature contributes equally to the distance calculations in algorithms like k-NN and SVM.\n",
    "\n",
    "### Getting ready\n",
    "\n",
    "We will use the previously defined `iterative_imputed_df` DataFrame for this recipe so no need to redefine it."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### How to do it...\n",
    "\n",
    "The `StandardScaler` standardizes features by removing the mean and scaling to unit variance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load libraries\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "\n",
    "# Initialize the StandardScaler\n",
    "scaler = StandardScaler()\n",
    "\n",
    "# Fit and transform the data using the iterative imputed DataFrame\n",
    "scaled_data = scaler.fit_transform(iterative_imputed_df)\n",
    "scaled_df = pd.DataFrame(scaled_data, columns=iterative_imputed_df.columns)\n",
    "scaled_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `MinMaxScaler` transforms features by scaling each feature to a given range, often between zero and one."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load libraries\n",
    "from sklearn.preprocessing import MinMaxScaler\n",
    "\n",
    "# Initialize the MinMaxScaler\n",
    "minmax_scaler = MinMaxScaler()\n",
    "\n",
    "# Fit and transform the data using the iterative imputed DataFrame\n",
    "minmax_scaled_data = minmax_scaler.fit_transform(iterative_imputed_df)\n",
    "minmax_scaled_df = pd.DataFrame(\n",
    "    minmax_scaled_data, columns=iterative_imputed_df.columns\n",
    ")\n",
    "minmax_scaled_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `Normalizer` scales individual samples to have unit norm."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load libraries\n",
    "from sklearn.preprocessing import Normalizer\n",
    "\n",
    "# Initialize the Normalizer\n",
    "normalizer = Normalizer()\n",
    "\n",
    "# Fit and transform the data using the iterative imputed DataFrame\n",
    "normalized_data = normalizer.fit_transform(iterative_imputed_df)\n",
    "normalized_df = pd.DataFrame(normalized_data, columns=iterative_imputed_df.columns)\n",
    "normalized_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Encoding Categorical Variables\n",
    "\n",
    "Encoding categorical variables is essential for converting non-numeric data into a format that can be used by machine learning algorithms.\n",
    "\n",
    "### Getting ready\n",
    "\n",
    "To begin, we will create a toy dataset composed of random, quantitative data, ten features, and several missing data values randomly spread throughout. We will then store the dataset in a pandas DataFrame() object for better readability."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c6885a26",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load libraries\n",
    "import numpy as np\n",
    "\n",
    "# Create sample categorical data with 20 records\n",
    "np.random.seed(2024)  # for reproducibility\n",
    "categories = [\"A\", \"B\", \"C\", \"D\"]\n",
    "categorical_data = pd.DataFrame(\n",
    "    {\n",
    "        \"Department\": np.random.choice(categories, size=20),\n",
    "        \"Position\": np.random.choice([\"Junior\", \"Senior\", \"Manager\"], size=20),\n",
    "        \"Location\": np.random.choice([\"NY\", \"SF\", \"LA\", \"CHI\"], size=20),\n",
    "    }\n",
    ")\n",
    "\n",
    "# Display the DataFrame with categorical values\n",
    "display(categorical_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### How to do it...\n",
    "\n",
    "The `OneHotEncoder` converts categorical values into a one-hot numeric array."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load libraries\n",
    "from sklearn.preprocessing import OneHotEncoder\n",
    "\n",
    "# Initialize the OneHotEncoder\n",
    "onehot_encoder = OneHotEncoder(sparse_output=False)\n",
    "\n",
    "# Fit and transform the data\n",
    "onehot_encoded_data = onehot_encoder.fit_transform(categorical_data)\n",
    "onehot_encoded_df = pd.DataFrame(\n",
    "    onehot_encoded_data, columns=onehot_encoder.get_feature_names_out()\n",
    ")\n",
    "onehot_encoded_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `LabelEncoder` encodes target labels with values between 0 and n_classes-1."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load libraries\n",
    "from sklearn.preprocessing import LabelEncoder\n",
    "\n",
    "# Initialize the LabelEncoder\n",
    "label_encoder = LabelEncoder()\n",
    "\n",
    "# Create a new DataFrame to store label encoded values\n",
    "label_encoded_df = pd.DataFrame()\n",
    "\n",
    "# Fit and transform each categorical column\n",
    "for column in categorical_data.columns:\n",
    "    label_encoded_df[f\"{column}_encoded\"] = label_encoder.fit_transform(\n",
    "        categorical_data[column]\n",
    "    )\n",
    "label_encoded_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `ColumnTransformer` allows different columns or column subsets of the input to be transformed separately and the features generated by each transformer will be concatenated to form a single feature space."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load libraries\n",
    "from sklearn.compose import ColumnTransformer\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "\n",
    "# Create sample mixed data with 20 records\n",
    "np.random.seed(2024)  # for reproducibility\n",
    "mixed_data = pd.DataFrame(\n",
    "    {\n",
    "        \"Age\": np.random.randint(25, 65, size=20),\n",
    "        \"Salary\": np.round(np.random.normal(60000, 15000, size=20), 2),\n",
    "        \"Experience\": np.random.randint(1, 20, size=20),\n",
    "        \"Department\": np.random.choice([\"IT\", \"HR\", \"Sales\", \"Finance\"], size=20),\n",
    "        \"Position\": np.random.choice([\"Junior\", \"Senior\", \"Manager\"], size=20),\n",
    "    }\n",
    ")\n",
    "\n",
    "# Display the DataFrame with mixed data\n",
    "display(mixed_data)\n",
    "\n",
    "# Initialize the ColumnTransformer\n",
    "column_transformer = ColumnTransformer(\n",
    "    transformers=[\n",
    "        (\"num\", StandardScaler(), [\"Age\", \"Salary\", \"Experience\"]),\n",
    "        (\"cat\", OneHotEncoder(), [\"Department\", \"Position\"]),\n",
    "    ],\n",
    "    remainder=\"passthrough\",\n",
    ")\n",
    "\n",
    "# Fit and transform the data\n",
    "transformed_data = column_transformer.fit_transform(mixed_data)\n",
    "\n",
    "# Get feature names for the transformed columns\n",
    "numeric_cols = [\"Age_scaled\", \"Salary_scaled\", \"Experience_scaled\"]\n",
    "categorical_cols = column_transformer.named_transformers_[\"cat\"].get_feature_names_out(\n",
    "    [\"Department\", \"Position\"]\n",
    ")\n",
    "\n",
    "# Create the transformed DataFrame\n",
    "transformed_df = pd.DataFrame(\n",
    "    transformed_data, columns=numeric_cols + list(categorical_cols)\n",
    ")\n",
    "transformed_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Introduction to Pipelines\n",
    "\n",
    "Pipelines are a simple way to streamline a machine learning workflow by chaining together transformers and estimators.\n",
    "\n",
    "### Getting ready\n",
    "\n",
    "The general syntax for defining a pipeline is as follows:\n",
    "\n",
    "```\n",
    "pipeline = Pipeline(\n",
    "    [(\"name of step\", transformer), (\"name of step\", transformer),…, (“name of step”, estimator]\n",
    ")\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### How to do it...\n",
    "\n",
    "A basic pipeline chains together a sequence of transformations and a final estimator."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load libraries\n",
    "from sklearn.pipeline import Pipeline\n",
    "from sklearn.impute import SimpleImputer\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "# First, separate features and target (assuming last column is target)\n",
    "X = transformed_df.iloc[:, :-1]  # all columns except last\n",
    "y = transformed_df.iloc[:, -1]  # last column\n",
    "\n",
    "# Split the data\n",
    "X_train, X_test, y_train, y_test = train_test_split(\n",
    "    X, y, test_size=0.2, random_state=2024\n",
    ")\n",
    "\n",
    "# Create a pipeline\n",
    "pipeline = Pipeline(\n",
    "    [\n",
    "        (\"imputer\", SimpleImputer(strategy=\"mean\")),  # handle missing values\n",
    "        (\"scaler\", StandardScaler()),  # scale the features\n",
    "    ]\n",
    ")\n",
    "\n",
    "# Fit and transform the training data\n",
    "X_train_transformed = pipeline.fit_transform(X_train)\n",
    "\n",
    "# Transform the test data\n",
    "X_test_transformed = pipeline.transform(X_test)\n",
    "\n",
    "# Create DataFrames with the transformed data (to preserve column names)\n",
    "X_train_transformed = pd.DataFrame(\n",
    "    X_train_transformed, columns=X_train.columns, index=X_train.index\n",
    ")\n",
    "\n",
    "X_test_transformed = pd.DataFrame(\n",
    "    X_test_transformed, columns=X_test.columns, index=X_test.index\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Visualizing Pipelines\n",
    "\n",
    "Visualizing pipelines can help understand the workflow and ensure all steps are correctly configured."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load libraries\n",
    "from sklearn import set_config\n",
    "\n",
    "# Set display configuration\n",
    "set_config(display=\"diagram\")\n",
    "\n",
    "# Display the pipeline\n",
    "pipeline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Feature Engineering\n",
    "\n",
    "Feature engineering involves creating new features or modifying existing ones to improve model performance.\n",
    "\n",
    "### Getting ready\n",
    "\n",
    "We will use the previously defined `X_train_transformed` DataFrame for this recipe so no need to redefine it."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### How to do it...\n",
    "\n",
    "The `PolynomialFeatures` transformer generates polynomial and interaction features."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load libraries\n",
    "from sklearn.preprocessing import PolynomialFeatures\n",
    "\n",
    "# Initialize the PolynomialFeatures\n",
    "poly = PolynomialFeatures(degree=2)\n",
    "\n",
    "# Fit and transform the X_train_transformed data\n",
    "poly_features = poly.fit_transform(X_train_transformed)\n",
    "poly_features_df = pd.DataFrame(\n",
    "    poly_features, columns=poly.get_feature_names_out(X_train_transformed.columns)\n",
    ")\n",
    "poly_features_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `KBinsDiscretizer` discretizes continuous features into k bins."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load libraries\n",
    "from sklearn.preprocessing import KBinsDiscretizer\n",
    "\n",
    "# Initialize the KBinsDiscretizer\n",
    "kbins = KBinsDiscretizer(n_bins=3, encode=\"ordinal\", strategy=\"uniform\")\n",
    "\n",
    "# Fit and transform the X_train_transformed data\n",
    "binned_data = kbins.fit_transform(X_train_transformed)\n",
    "binned_df = pd.DataFrame(binned_data, columns=X_train_transformed.columns)\n",
    "binned_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`RFE()` is a powerful technique that recursively removes the least important features based on a specified estimator's importance ranking."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load libraries\n",
    "from sklearn.feature_selection import RFE\n",
    "from sklearn.linear_model import LinearRegression\n",
    "\n",
    "# Initialize the RFE\n",
    "rfe = RFE(estimator=LinearRegression(), n_features_to_select=1)\n",
    "\n",
    "# Fit the RFE to the X_train_transformed and y_train data\n",
    "rfe.fit(X_train_transformed, y_train)\n",
    "\n",
    "# Get the ranking of features\n",
    "rfe.ranking_"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "29872f39",
   "metadata": {},
   "source": [
    "`SelectFromModel()` allows users to select features based on their importance weights derived from a given model. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "08e7c822",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load libraries\n",
    "from sklearn.feature_selection import SelectFromModel\n",
    "from sklearn.linear_model import LinearRegression\n",
    "\n",
    "# Initialize SelectFromModel with LinearRegression\n",
    "selector = SelectFromModel(\n",
    "    estimator=LinearRegression(),\n",
    "    prefit=False,\n",
    "    threshold=\"mean\",  # Use mean of feature importances as threshold\n",
    ")\n",
    "\n",
    "# Fit the selector\n",
    "selector.fit(X_train_transformed, y_train)\n",
    "\n",
    "# Get selected features\n",
    "selected_features_mask = selector.get_support()\n",
    "\n",
    "# Get feature names that were selected\n",
    "selected_features = X_train_transformed.columns[selected_features_mask].tolist()\n",
    "\n",
    "# Print feature importance scores and selection status\n",
    "feature_importance = pd.DataFrame(\n",
    "    {\n",
    "        \"Feature\": X_train_transformed.columns,\n",
    "        \"Importance\": selector.estimator_.coef_,\n",
    "        \"Selected\": selected_features_mask,\n",
    "    }\n",
    ")\n",
    "feature_importance.sort_values(\"Importance\", key=abs, ascending=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Practical Exercise on Data Preprocessing\n",
    "\n",
    "In this section, we will combine all the recipes into a comprehensive pipeline and apply it to the California Housing dataset."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Comprehensive Pipeline\n",
    "\n",
    "We will create a pipeline that includes imputation, scaling, encoding, and modeling steps."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load libraries\n",
    "YOUR CODE HERE\n",
    "\n",
    "# Load the California Housing dataset\n",
    "YOUR CODE HERE\n",
    "\n",
    "# Split the data\n",
    "YOUR CODE HERE\n",
    "\n",
    "# Create a comprehensive pipeline\n",
    "YOUR CODE HERE\n",
    "\n",
    "# Fit the pipeline\n",
    "YOUR CODE HERE\n",
    "\n",
    "# Evaluate the pipeline\n",
    "YOUR CODE HERE"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
